Extensible surface for consuming information extraction serivices

ABSTRACT

Representing structured data extracted from unstructured data in fashion allowing querying using relational database concepts. A method includes receiving user input specifying one or more database views. The method further includes receiving user input specifying an information extraction technique, such as an extraction workflow. The method further includes receiving user input specifying a corpus of data. The extraction technique is applied to the corpus of data to produce the one or more database views. These views can then be queried or operated on using database tools.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of, claims priority to, and thebenefit of U.S. patent application Ser. No. 13/040,939, entitled“EXTENSIBLE SURFACE FOR CONSUMING INFORMATION EXTRACTION SERVICES”,which was filed on Mar. 4, 2011, and which is to issue on Jun. 23, 2015,as U.S. Pat. No. 9,064,004. The parent of this continuation, U.S. patentapplication Ser. No. 13/040,939, is incorporated by reference in itsentirety herein.

BACKGROUND Background and Relevant Art

Computers and computing systems have affected nearly every aspect ofmodern living. Computers are generally involved in work, recreation,healthcare, transportation, entertainment, household management, etc.

Computing systems are often used for information management. Inparticular, computing systems can be used to provide information tousers. However, information may be stored and made available to users ina number of disparate ways. For example, computing systems may implementrelational database management systems (RDBMSs) to store and organizedata as structured data. Structured data is data that is organizedsemantically. Further, similar data entities are often grouped togetherby relationships in relational databases or by typed classes in objectoriented systems. An example of a simple RDBMS is just a table withcolumns and rows. The columns describe categories of data while the rowsstore instances of the categories. RDBMS systems facilitate efficientretrieval of data. For example, a simple table may have a column forcities and a column for current temperatures. To find the temperature ata given city, the city column is identified, and the city of interest issearched for in the city column and found at a particular row in thecity column. The temperature column is identified and the rowcorresponding to the city of interest is identified in the temperaturecolumn thus identifying the temperature at the city of interest. Thus,typically data in a RDBMS is structured data.

Another type of data is unstructured data. Unstructured data istypically not organized in a way that allows a computing system toimmediately identify the type or relational structure of the data. Forexample, a text document may contain the following data “The temperaturein Rio de Janeiro right now is 82 degrees.” However, Rio de Janeiro isnot structured as a city type and 82 is not structured as a temperaturetype, nor is there a formalized structure map for Rio de Janeiro and 82degrees. Additionally, the text documents may contain a number ofsentences describing various temperatures in various cities around theworld. Determining the temperature of a given city using unstructureddata in the text file may be more difficult for automated computingsystems than using a structured data database where data can be searchedbased on categories.

The subject matter claimed herein is not limited to embodiments thatsolve any disadvantages or that operate only in environments such asthose described above. Rather, this background is only provided toillustrate one exemplary technology area where some embodimentsdescribed herein may be practiced.

BRIEF SUMMARY

One embodiment illustrated herein is a method practiced in a computingenvironment. The method includes acts for representing structured dataextracted from unstructured data in fashion allowing querying usingrelational database concepts. The method includes receiving user inputspecifying one or more database views. The method further includesreceiving user input specifying an information extraction technique,such as an extraction workflow. The method further includes receivinguser input specifying a corpus of data. The extraction technique isapplied to the corpus of data to produce the one or more database views.These views can then be queried or operated on using database tools.

This Summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used as an aid in determining the scope of the claimed subjectmatter.

Additional features and advantages will be set forth in the descriptionwhich follows, and in part will be obvious from the description, or maybe learned by the practice of the teachings herein. Features andadvantages of the invention may be realized and obtained by means of theinstruments and combinations particularly pointed out in the appendedclaims. Features of the present invention will become more fullyapparent from the following description and appended claims, or may belearned by the practice of the invention as set forth hereinafter.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to describe the manner in which the above-recited and otheradvantages and features can be obtained, a more particular descriptionof the subject matter briefly described above will be rendered byreference to specific embodiments which are illustrated in the appendeddrawings. Understanding that these drawings depict only typicalembodiments and are not therefore to be considered to be limiting inscope, embodiments will be described and explained with additionalspecificity and detail through the use of the accompanying drawings inwhich:

FIG. 1 illustrates processing a corpus of unstructured data through aworkflow to produce one or more views;

FIG. 2 illustrates a collection of views produced by an exampleworkflow; and

FIG. 3 illustrates a method of representing structured data extractedfrom unstructured data in fashion allowing querying using relationaldatabase concepts.

DETAILED DESCRIPTION

Some embodiments described herein may implement a user surface forrepresenting extraction of unstructured data into structured data in anRDBMS. Some embodiments include functionality for representingextractions that operate over entire corpuses of documents representedas rowsets instead of only individual documents. Some embodimentsimplement functionality for exposing complex, independently queryableextraction output such as entity-relationship graphs. Some embodimentsimplement functionality for exposing extraction output throughwell-understood and well-supported RDBMS concepts such as tables, views,etc. In particular, embodiments may expose extraction results as viewsor schemas containing views, such that these results can representcomplex structures such as graphs and are independently queryable. Someembodiments may implement interfaces and extraction methods to maintainthe same feel for applying an extraction, no matter the specifics of theextraction, and thus is extensible to new extractions in a database.

Data extraction systems may be used to extract and categorize data fromunstructured data to allow for automated systems to do categorized datasearches on the data. These extraction systems can determine, or attemptto determine, type or relationship information, such that unstructureddata can be organized into structured data.

Users increasingly use RDBMSs to store unstructured documents such asfiles, images, or large text values. Some methods for managing such dataimplement information extraction. Information extraction includesprocesses that input unstructured documents, then output structured datadescribing them. Some examples include, but are not limited to,extracting ID3 metadata from MP3 files, extracting entities andrelationships from text, and recognizing faces in images or videos.Performing such extraction in the database is valuable for many reasons,such as keeping data-heavy processing near data and leveraging existingmanagement features like backup/restore, replication, security, etc.

RDBMSs may support some built-in extraction. This falls into two maincategories: indexes and special data types. For instance, fulltext andXML indexing input text, and output structured indexes. Likewise,special data types for multimedia perform extraction through functions,for example to extract color data from images.

Referring now to FIG. 1, an example is illustrated. FIG. 1 illustrates acorpus of data 102. The corpus of data 102 includes unstructured data.For example, the corpus of data 102 may include one or more unstructuredtext documents, media files, images, videos, biometric data, etc. Theunstructured data includes data that, at an entity level, is notorganized semantically in that it does not have a formalized type and/oris not in a formal entity level relationship where one entity isformally related, such as through a graph, tree and/or otherrelationship structure. As noted, the corpus of data can be a singlefile or document, or a collection of files and/or documents. In someembodiments, a single file or document may be used for ad-hoc extractionand searching as will be explained in more detail below. In otherembodiments, a single file or document, or a collection of files and/ordocuments can be extracted to a database or other structure for on-goingsearching and/or access beyond a single ad-hoc instance.

The corpus of data 102 may be fed into an information extractionworkflow 104. The information extraction workflow 104 defines the waythat data is extracted from the corpus of data 102 to organize the datain the corpus of data 102 into structured data. Examples of informationextraction workflows are now illustrated. While specific examples areillustrated, it should be appreciated that the examples are notexhaustive of extraction techniques, and other extraction techniques maybe used.

In some embodiments, the extraction workflow may comprise a phrasesemantic extraction technique. In particular, embodiments may includemodules that are able to determine metadata about a phrase or words in aphrase based on the semantic environment of the word or phrase. Forexample, relationships may be determined by proximity of words to oneanother. For example, if the terms Microsoft and Excel are found next toeach other across a corpus of documents, phrase semantic analysis candetermine that these two terms are related.

Dictionary or lexical definitions may be used to create types orrelationships for words or phrases. For example, a lexical definition ofRio de Janeiro would identify it as a city, and thus metadata could beextracted classifying Rio de Janeiro as a city type. In another example,a document may have the text “Jan. 13, 2011.” A lexical look-up ofJanuary can be used to determine that it is a month used in determiningdates, and thus, a determination can be made that this text is a datetype.

In some embodiments, the extraction workflow may include relationshipidentification functionality. For example, a text document may containthe phrase “city of Rio Rio de Janeiro.” Based on the syntax of thephrase, it can be determined that Rio de Janeiro is an object of type“city”. In another example, a text document may include the text“Author: Robert Smith.” Based on common syntax it can be extracted that“Robert Smith” is an object of type “author”. Syntax and relationshipidentification can be based on user input identifying relationshipsand/or learning based on experience with identifying relationships. Forexample, user input may be received wherein a user identifies arelationship in a phrase such as by identifying the object inidentifying the type. For example, in the examples illustrated above,the user may identify the word city to represent a type and Rio DeJaneiro to represent an object of type city. A subsequent phrase withsimilar syntax could be dissected to extract metadata for creatingstructured data.

In some embodiments, the extraction workflow may comprise a propertypromotion. For example, a music file, such as an mp3 file, may includemetadata in the mp3 file. Such metadata may define the artist, songtitle, song length, etc. This metadata can be promoted to structureddata.

In some embodiments, the extraction workflow may comprise an entityrecognition or entity extraction workflow. For example, a document maycontain a list of company names. A workflow may be designed to identifycompany names as company names. This can be used to structure data inthe document by either type or on a relationship basis.

In some embodiments, the extraction workflow may comprise entitydisambiguation. For Example, a workflow may encounter different data inone or more documents for Pedro DeRose, and Dr. DeRose, and Mr. DeRose.The workflow may be able to determine that each of these data pointsrepresent the same person.

In some embodiments, the extraction workflow may comprise patternrecognition. One such example is illustrated in facial recognition inimages. For example, in one embodiment, pattern recognition could simplynote that a face appears. Alternatively or additionally, embodimentscould identify the face based on a dictionary of faces.

As illustrated in FIG. 1, passing the corpus of data 102 through theextraction workflow 104 can be used to produce one or more databaseviews 106. The database views may be ad hoc views over which a singlequery or single set of queries can be run, or more persistent for useover an extended period of time for an extended number or set ofqueries.

The views can represent one of a number of different forms of data,including tables, graphs, etc. In some embodiments, a collection ofviews may represent this data. For example, a schema in SQL Serveravailable from Microsoft Corporation is an example of such a collectionof views. When outputting multiple views, the extraction workflow maygroup them in such a collection.

In addition to the extraction technique being used to produce views, theextraction technique may further be used to produces procedures. Theseprocedures may define methods of operating on, managing, or refreshingthe content of the one or more views. The procedures may be accessibleusing a database system used to operate on the views.

At a high level, users can start with a table of unstructured documents.In FIG. 1, this is illustrated as the corpus of data 102. The corpus ofdata may, in some embodiments, include several different documents. Theuser specifies an extraction-related service to perform extractions,such as extracting metadata properties, extracting entities andrelationships, using phrase semantics for extraction, etc. This isillustrated as an example in FIG. 1 by the extraction workflow 104. Theworkflow 104 represents a specific type of extraction specified by auser. The user also chooses where they want to expose the structuredresults. This is illustrated in FIG. 1 by the view 106.

In some embodiments, different methods of choosing and exposing anextraction may be the same, or very similar, no matter the specificextraction performed so as to create a generalized process forstructuring unstructured data. In particular, a user may be able to usea standardized user interface or API to invoke different extractions.

From this high-level illustration, various intermediate concepts arefurther explored. The first concept is the concept of stored documents.This relates to how documents are stored in a database prior toextraction. A stored document may be a document that is a row in a tableor view. The row, in this example, has a unique id, which may be part ofa unique key on the table. The row may have a number of columns withtext or binary, which may be equivalent to named sections of a document.A corpus of documents may be a rowset, such as a table or view.

The second concept is the concept of ad-hoc documents. This concept isdirected to how documents are represented when they are not stored, butinstead provided for a single query. In some example embodimentsdirected for use with SQL Server available from Microsoft Corporation ofRedmond Washington, a SQL Server CLR type, called Document, can be usedto represent a document specified as a URI. For example: DECLARE @dDOCUMENT=‘file// . . . ’.

A third concept is the concept of an extraction workflow. An extractionworkflow defines and names a process for extracting structure fromunstructured data. Some embodiments may be implemented where users cancreate their own extraction workflows. Some embodiments may have,additionally or alternatively, system-defined workflows. For example,the system could define a property_promotion workflow that extractsmetadata from files, or an entity_relationship workflow that extractsnamed entities and relationships. Each workflow is a named black-boxthat exposes what configuration options it accepts, and what otherextraction workflows should exist before it can be created. This may beexposed through system catalogs in the database.

A fourth concept is the concept of extraction invocation. Extractioninvocation includes applying an extraction workflow to a particularcorpus of documents. The invocation includes specifying theconfiguration options available for that workflow, specifying the updatepolicy for how the extraction output should be updated when the corpuschanges (e.g., automatic, manual), and an existing extraction output tobuild on when one is required. Thus, the invocation can be seen as thecreation of the extraction pipeline that will process documents andproduce output using the workflow.

In some embodiments, to represent an invocation, a clause may be used.The following illustrates an example of a clause that may be defined:

-   -   USING EXTRACTION extraction_workflow_name    -   ON document_table(document_columns)    -   WITH configuration_options    -   REFERENCES existing_extraction_output

The preceding illustrates a very specific example of an invocationclause that may be accepted by a system, and alternative clauses orother methods of invocation may be used. USING EXTRACTION is used todefine one or more extraction workflows to operate on a corpus of data.Here, extraction workflow name represents a user-specified extractionworkflow. ON is used to define the corpus of data. WITH is used todefine various configuration options, such as a dictionary to use fordictionary-based extractions, or a set of stop-words to ignore in theinput documents. REFERENCES is used to define an existing extractionoutput to build on. For example, an extraction workflow that identifiesrelationships between entities may build on the output of an earlierextraction workflow that extracted these entities. Here, REFERENCESwould point to the output of this earlier extraction workflow.

Some embodiments may implement and use ad-hoc invocation. Ad-hocinvocation applies an extraction workflow to a particular ad-hocdocument. However, some extractions use a corpus including multipledocuments for context. For example, consider a workflow that extractskey concepts from text using statistical analysis of phrase frequenciesin a corpus. Such an extraction obtains better results using a corpusincluding multiple documents for context to extract key phrases fromeach individual document. Thus, an ad-hoc extraction can specify anexisting extraction output created by an extraction over an existingcorpus as a basis. As with non-ad-hoc invocations, embodiments may use avariation on the above clause for ad-hoc invocations. The followingillustrates a very specific variation of the above illustratedinvocation.

-   -   USING EXTRACTION statistically_key_phrases    -   ON ad-hoc document    -   BASIS existing_extraction_output

In this example, key phrases would be extracted from the ad-hoc documentnot as an individual document, but rather as if it were part of thecorpus used to produce the existing extraction output Extraction outputrepresents the structured output of invoking the extraction workflow ona corpus of documents. This output may be independently queryable.However, it is derived data, in the sense that it comes from applying aprocess over base data. In an RDBMs, the concept that representsindependently queryable derived data may be a view. Thus, the output ofan extraction, in some embodiments, is exposed as a view. This view canbe persisted using an appropriate invocation, or used for a singlead-hoc query through an ad-hoc invocation in an ad-hoc command. Forexample, a WITH command is an ad-hoc command used in SQL Serveravailable from Microsoft Corporation of Redmond Wash.

Some extractions produce output that may not be cleanly displayed as asingle view. For example, consider an extraction that outputs anentity-relationship graph. One natural relational representation forsuch a graph is to normalize it over multiple tightly-related views.Some database systems implement collection units that can contain aplurality of views. For example, in SQL Server, a unit that can containmultiple views is referred to as a “schema”, not to be confused withschema as used in other contexts defining structure and content. Thus,when an extraction outputs multiple views, it may be persisted as acollection containing those views.

What follows are a number of use case examples. Embodiments may beimplemented where a user determines which types of extractions areavailable. For example, a database may include a user interface thatallows a user to query the extractions available and the properties thatshould be specified for those extractions. For example, a user couldsubmit a query that would cause the system to indicate that propertypromotion and phrase semantic extractions are available. For example,some embodiments include a command which allows a user to determine whatextraction workflows are available. For example, in one very specificembodiment, the following command:

-   -   SELECT*FROM sys.extraction_workflows;        produces the following table output:

extrac- tion_work- outputs_(—) outputs_(—) is_(—) flow_id name viewcollection global 0 Property_Promotion 1 0 0 1 Phrase_Semantics 0 1 1This table illustrates the extraction workflows available(Property_Promotion and Phrase_Semantics), whether or not a workflowoutputs views, whether or not the workflow outputs collections of views,and whether or not a workflow is global in that the output of theextraction depends on the corpus of documents it inputs rather than eachdocument individually.

Embodiments may include functionality for allowing a user to check theoptions for extraction workflows. For example, in one specificembodiment, the following command:

-   -   SELECT*FROM sys.extraction_workflow_options;        produces the following table output:

extraction_(—) option_(—) systype_(—) type_(—) enumerated_(—)Default_(—) is_(—) workflow_id id name id desc values value required 0 0DOCUMENT 256 property NULL NULL 1 PROPERTY list LIST 1 0 STOPLIST 256stoplist NULL 0 0This table illustrates options for the extraction workflows in thepreceding table.

Once extraction workflows are know, a user can specify a given workflow.The following illustrates an example of a user invoking propertypromotion extraction using the specific example tools illustrated above.In particular, in this example, for each document in a corpus 102, theextraction workflow 104 promote properties into columns. For example,metadata in music files may be promoted to columns entries. A user canthen use database queries and properties included as part of a databaseto perform queries on the extracted properties.

First, a user creates a property list, for a property-scoped search. Inthe example illustrated, a user will change SEARCH PROPERTY LIST toDOCUMENT PROPERTY LIST. This is done with the following command:

-   -   CREATE DOCUMENT PROPERTY LIST manualProperties . . . .

The user then invokes the property promotion extraction, persisting theoutput into a view. The following invocation, following the examplesabove, may be used:

-   CREATE VIEW Production.DocumentProperties    -   USING EXTRACTION Property_Promotion    -   ON Production.ProductManual(File TYPE COLUMN FileType)    -   KEY INDEX PK_ProductManual_ProductManualID    -   WITH DOCUMENT PROPERTY LIST=manualProperties;

In the example above, the view name will beProduction_DocumentProperties, the extraction workflow used to extractdata on the corpus 102 will be Property_Promotion, the extraction willbe performed on Production.ProductManual, and the extraction will beperformed into the previously created manualProperties list.

As noted, once extraction has occurred extracting the data into one ormore views, a database query native to the database can be used to querythe view. In some embodiments, the extraction may result in one propertyper column, with sparse columns and a column set. As noted, once theview is created, native database operations can be performed. Forexample: Queries can be performed. The created view, views, orcollection of views can be altered. The view and/or invocation can bedropped. Embodiments can check Metadata and/or crawl status/information.Embodiments can check which columns are input for use with theextraction use. Embodiments can check which options were set in anextraction use.

As noted, embodiments may facilitate ad-hoc invocation of extractions.The following illustrates an example of an ad-hoc property promotionextraction. In the following example, one embodiment declares an ad-hocdocument, then applies the extraction to it. Applying the extraction issimilar, and continues to use common relational concepts such as commontable expressions.

The Property Promotion extraction is not global, meaning that it doesnot need the context of the rest of the corpus to work. This isspecified by is_global in sys.extraction_workflows.

DECLARE @D Document = ‘file://...’; WITH P AS ( USING EXTRACTIONProperty_Promotion ON @D WITH DOCUMENT PROPERTY LIST = manualProperties) SELECT * FROM P;

The following example illustrates invoking a phrase semantics extractionworkflow. In the present example, the phrase semantics workflow outputsa collection of multiple views. In the SQL Server example, thiscollection of multiple views may be as a schema in SQL Server. In thepresent example, the collection is declared read-only because itscontents are entirely controlled by the extraction invocation.

-   CREATE SCHEMA ProductSemantics AS READ ONLY    -   USING EXTRACTION Phrase_Semantics    -   ON Production.ProductDescription(Description LANGUAGE ‘English’)    -   KEY INDEX PK_ProductDescription_ProductDescriptionID    -   WITH STOPLIST=myStoplist;

FIG. 2 illustrates a collection 200 created using the invocation.

The following illustrates another example where an ad-hoc invocation ofa phrase semantics extraction is implemented. In the following example,an ad-hoc document is declared and then the extraction is applied to it.In this example, the ad-hoc document is represented by document_id NULL.

In the illustrated example, there are two interesting complicationscompared to property promotion. First, the phrase semantics extractionoutputs a collection of views. To facilitate this, the clause used tocreate ad-hoc data in the RDBMs (e.g., the WITH clause as defined in SQLServer available from Microsoft Corporation of Redmond Wash.) can beextended to allow a transient schema, similarly to how it currentlyallows a transient table. Second, phrase semantics, in the presentexample, is a global workflow, meaning it requires the context of therest of the corpus. This is specified above by is_global insys.extraction_workflows. Thus, an ad-hoc invocation uses an existingnon-adhoc extraction as a basis. The following illustrates an example:

DECLARE @D Document = ‘file://...’; WITH SCHEMA S AS ( USING EXTRACTIONPhrase_Semantics ON @D BASIS ProductSemantics ) SELECT * FROMS.KeyPhrase WHERE document_id IS NULL;

The following discussion now refers to a number of methods and methodacts that may be performed. Although the method acts may be discussed ina certain order or illustrated in a flow chart as occurring in aparticular order, no particular ordering is required unless specificallystated, or required because an act is dependent on another act beingcompleted prior to the act being performed.

Referring now to FIG. 3, a method 300 is illustrated. The method 300 maybe practiced in a computing environment. The method 300 includes actsfor representing structured data extracted from unstructured data infashion allowing querying using relational database concepts. The method300 includes receiving user input specifying one or more database views(act 302). Examples of this are illustrated above where a user uses theCREATE VIEW or CREATE SCHEMA clauses illustrated above.

The method 300 further includes receiving user input specifying aninformation extraction technique (act 304). For example, a user canspecify an extraction workflow. Examples of this are illustrated abovewhere a user inputs a USING EXTRACTION clause.

The method 300 further includes receiving user input specifying a corpusof data (act 306). Examples of this are illustrated above by the use ofthe ON clauses illustrated.

The method 300 further includes applying the extraction technique to thecorpus of data to produce the one or more database views (act 308). FIG.1 illustrates applying an extraction technique (in this example, aworkflow 104) to a corpus of data 102 to produce a view 106.

Further, the methods may be practiced by a computer system including oneor more processors and computer readable media such as computer memory.In particular, the computer memory may store computer executableinstructions that when executed by one or more processors cause variousfunctions to be performed, such as the acts recited in the embodiments.

Embodiments of the present invention may comprise or utilize a specialpurpose or general-purpose computer including computer hardware, asdiscussed in greater detail below. Embodiments within the scope of thepresent invention also include physical and other computer-readablemedia for carrying or storing computer-executable instructions and/ordata structures. Such computer-readable media can be any available mediathat can be accessed by a general purpose or special purpose computersystem. Computer-readable media that store computer-executableinstructions are physical storage media. Computer-readable media thatcarry computer-executable instructions are transmission media. Thus, byway of example, and not limitation, embodiments of the invention cancomprise at least two distinctly different kinds of computer-readablemedia: physical computer readable storage media and transmissioncomputer readable media.

Physical computer readable storage media includes RAM, ROM, EEPROM,CD-ROM or other optical disk storage (such as CDs, DVDs, etc), magneticdisk storage or other magnetic storage devices, or any other mediumwhich can be used to store desired program code means in the form ofcomputer-executable instructions or data structures and which can beaccessed by a general purpose or special purpose computer. Physicalcomputer readable storage media specifically excludes propagatedsignals.

A “network” is defined as one or more data links that enable thetransport of electronic data between computer systems and/or modulesand/or other electronic devices. When information is transferred orprovided over a network or another communications connection (eitherhardwired, wireless, or a combination of hardwired or wireless) to acomputer, the computer properly views the connection as a transmissionmedium. Transmissions media can include a network and/or data linkswhich can be used to carry or desired program code means in the form ofcomputer-executable instructions or data structures and which can beaccessed by a general purpose or special purpose computer. Combinationsof the above are also included within the scope of computer-readablemedia.

Further, upon reaching various computer system components, program codemeans in the form of computer-executable instructions or data structurescan be transferred automatically from transmission computer readablemedia to physical computer readable storage media (or vice versa). Forexample, computer-executable instructions or data structures receivedover a network or data link can be buffered in RAM within a networkinterface module (e.g., a “NIC”), and then eventually transferred tocomputer system RAM and/or to less volatile computer readable physicalstorage media at a computer system. Thus, computer readable physicalstorage media can be included in computer system components that also(or even primarily) utilize transmission media.

Computer-executable instructions comprise, for example, instructions anddata which cause a general purpose computer, special purpose computer,or special purpose processing device to perform a certain function orgroup of functions. The computer executable instructions may be, forexample, binaries, intermediate format instructions such as assemblylanguage, or even source code. Although the subject matter has beendescribed in language specific to structural features and/ormethodological acts, it is to be understood that the subject matterdefined in the appended claims is not necessarily limited to thedescribed features or acts described above. Rather, the describedfeatures and acts are disclosed as example forms of implementing theclaims.

Those skilled in the art will appreciate that the invention may bepracticed in network computing environments with many types of computersystem configurations, including, personal computers, desktop computers,laptop computers, message processors, hand-held devices, multi-processorsystems, microprocessor-based or programmable consumer electronics,network PCs, minicomputers, mainframe computers, mobile telephones,PDAs, pagers, routers, switches, and the like. The invention may also bepracticed in distributed system environments where local and remotecomputer systems, which are linked (either by hardwired data links,wireless data links, or by a combination of hardwired and wireless datalinks) through a network, both perform tasks. In a distributed systemenvironment, program modules may be located in both local and remotememory storage devices.

The present invention may be embodied in other specific forms withoutdeparting from its spirit or characteristics. The described embodimentsare to be considered in all respects only as illustrative and notrestrictive. The scope of the invention is, therefore, indicated by theappended claims rather than by the foregoing description. All changeswhich come within the meaning and range of equivalency of the claims areto be embraced within their scope.

What is claimed is:
 1. In a computing environment, a method ofrepresenting structured data extracted from unstructured data in fashionallowing querying using relational database concepts, the methodcomprising: receiving user input specifying one or more database views;receiving user input specifying an information extraction technique;receiving user input specifying a corpus of data comprising unstructureddata; applying the extraction technique to the corpus of data to extractstructured data from the unstructured data of the corpus of data; andproducing the one or more database views including the extractedstructured data.